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We correct claims about lower bounds on mutual information (MI) between real-valued random 
variables made in A. Kraskov et al, Phys. Rev. E 69, 066138 (2004). We show that non-trivial lower 
bounds on MI in terms of linear correlations depend on the marginal (single variable) distributions. 
This is so in spite of the invariance of MI under reparametrizations, because linear correlations 
are not invariant under them. The simplest bounds are obtained for Gaussians, but the most 
interesting ones for practical purposes are obtained for uniform marginal distributions. The latter 
can be enforced in general by using the ranks of the individual variables instead of their actual 
values, in which case one obtains bounds on MI in terms of Spearman correlation coefficients. We 
show with gene expression data that these bounds are in general non-trivial, and the degree of their 
(non-)saturation yields valuable insight. 

PACS numbers: 



Mutual information [1| between two objects is the dif- 
ference between the combined lengths of their individual 
descriptions and the length of a joint description, all de- 
scriptions being "optimal" , i.e. lossless and redundancy- 
free. In the framework of algorithmic information theory 
2] , this is taken literally, i.e. the "objects" are sequences 
of letters of some alphabet, and "description" means a 
compression of the sequence on some specified but oth- 
erwise arbitrary universal Turing machine. In the frame- 
work of Shannon theory, in contrast, we deal with random 
variables, and "description length" is to be understood as 
the minimal average information needed to specify their 
realizations, given the probability distributions. 

In the following we shall only use the Shannon frame- 
work, but we shall not forget entirely about individual 
objects. When confronted with them, we make some (ex- 
plicit or implicit) estimate about the probability distri- 
bution (assuming that the observed objects are in some 
sense "typical" ) ; computing their MI is actually a prob- 
lem of statistical inference. 

More precisely, consider two random variables X and 
Y with realizations x,y and probability densities px(x) 
and py{v)- For simplicity we shall assume that x and 
y are both scalars taken either from a finite interval or 
from the interval [— oo, oo]. In both cases px and p y are 
normalized to 1. The joint distribution is p(x,y). The 
MI is then defined as 

I{X : Y) = [ dxdy p(x, y) log p{x ' y) (1) 

J px{x)p Y {y) 

where the base of the logarithm specifies the units in 
which information is measured. Bits correspond to loga- 
rithm base 2. 

From this one sees that / is symmetrical, I(X : Y) = 
I(Y : X), and positive definite: I(X : Y) = iff X 
and Y are strictly independent. Thus I(X : Y) is a 
universal measure of dependency, being non-zero when- 
ever X and Y have anything in common. This can also 
be seen in the following way: the (differential) entropy 
H(X) = — J dx px(x) \ogpx(x) is the (negative) average 



log-likelihood of x, and 

I(X:Y) = H(X)-H(X\Y) (2) 

is the logarithm of the ratio between the unconditioned 
likelihood of x and the posterior likelihood conditioned 
on the value y of Y. 

For the differential entropy, there is a well known up- 
per bound in terms of the variance: H(X) is maximal 
for a Gaussian with the same variance as the data [lj. 
Indeed, this is true also for multivariate distributions. In 
the appendix of [3] , a formal proof based on Lagrangian 
multipliers was given that analogous bounds hold also for 
the MI. According to 0], a given covariance matrix im- 
plies a lower bound on the MI. Unfortunately, this proof 
is wrong, and the claim made in Q is incorrect. The error 
in Q was subtle: The unique solution of the Lagrangian 
variational problem was given correctly, but the fact was 
missed that this solution is in general a saddle point, the 
correct bound being an infimum which is not reached by 
any actual distribution (at least not by a distribution in 
the class admitted in the variational problem). 

Indeed, it is easily seen that the MI can be arbitrarily 
small for any value of the correlation. Assume that the 
joint distribution is a sum of a delta peak with weight 
1 — e centered at (x, y) — (1,1) and a 2-d Gaussian with 
weight e centered at the origin, 

p(x, y) = (1 - e)S(x - l)S(y - 1) + -^T^. (3) 

Then the correlation between X and Y varies between 
zero and one as the width a shrinks to zero, for any fixed 
e > 0. But the MI is bounded for all a by I(X : Y) < 
— eloge — (1 — e) log(l — e), which tends to zero as e — > 0. 
Thus the MI can be arbitrarily close to zero, even when 
the correlation is arbitrarily close to 1 - although this is 
unlikely to appear in real applications, except for outliers. 

It is the purpose of the present paper to present correct 
bounds replacing those given in [3]. As we shall see, to 
obtain non-trivial bounds for the MI, one needs both the 



2 



covariance matrix and the marginal distributions. But 
the latter can be chosen arbitrarily to a large amount, 
since I(X : Y) as defined in Eq. (1) is invariant under 
homeomorphism. Let 4>[x) be a continuous and mono- 
tonic function, such that its inverse <^~ 1 (x) is also con- 
tinuous and monotonic, and let X' be a random variable 
with realization x' = 4>{x) if X has realization x. Then 



Px(x) = 



d(j)(x) 



dx 



Px'(x'), 



(4) 



and I(X : Y) = I(X' : Y). By symmetry, the same holds 
for homeomorphisms of Y. 

This leads to the following strategy for obtaining 
bounds on I(X : Y): One first transforms X and Y in- 
dependently so that they have a given distribution, e.g. 
a Gaussian or a uniform distribution. Notice that the 
first and second moments in general will change during 
such a transformation. After that is done, one applies 
the bound suitable for the chosen marginal distributions. 

The case of Gaussian marginal distributions is the sim- 
plest to treat theoretically. In that case the arguments 
given in the appendix of |3j apply, and the MI is bounded 
from below by the MI of a joint Gaussian with the ob- 
served first & second moments. But this is not the most 
practical choice, because it is non-trivial to transform any 
empirical distribution into a Gaussian. 

For practical purposes much more suitable is trans- 
formation to uniform distributions over finite intervals, 
say x 1 G [—1,1] and y' G [—1,1]. This transformation, 
which also leads usually to improved MI estimates, is 
de facto achieved by using for x' and y 1 their normalized 
ranks. Assume that the empirical data consist of N pairs 
(xi,yi), i = 1,...N. Then the rank r, of Xi is defined 
as the number of values Xj which are less than or equal 
to Xi (here we assume that all different, as would 

be true with probability 1 if X is drawn from a contin- 
uous distribution; if there are degeneracies due e.g. to 
discretization, we remove them by adding small random 
fluctuations to X{). Finally, 

x\ = 2 n /N - 1. (5) 

and analogously for y. Notice that this does not, strictly 
speaking, define X' , as it defines the homeomorphism 4> 
only at the discrete values Xi, but this does not pose a 
practical problem. Furthermore, in the limit N — > oo 
the "empirical <f>(x)" tends with probability 1 towards a 
true homeomorphism. The linear correlation between the 
ranks of x and y is by definition the Spearman coefficient 
S = Cx'Y> 1- 

To obtain a bound on the MI for given marginal dis- 
tributions and given first & second moments, we use the 
Lagrangian method. Without loss of generality we as- 
sume that the data are centered, i.e. {X} = (Y) = 0. We 
use p(x, y) as independent variables, and 



Px(x) 



C 



XY 



dy p(x, y), p Y (y) = J dx p(x, y) 
dxdy xy p(x, y)/[ax<yy} 



(6) 



as constraints. The Lagrangian function is 

p(x, y) 



L = 



dxdy p(x, y) log 



Px(x)py(y) 
dx v x {x)[p x {x) - I dyp{x,y)] 



dy v Y {y)[pY{y) - J dxp(x,y)] 
\\axOyCxr ~ If dxdy xyp(x,y)] 



(J) 



where v x {x), vy(y), and A are Lagrangian parameters. 
The variational equations are 



SL 



log- 



p{x,y) 



+l-v x (x)-v Y (y)-\xy = 0, 

(8) 



Sp(x,y) "° Px{x)py{v) 
which can also be written as 

p(x,y) = fx(x)f Y (y)e- x(x - y) 



(9) 



with unknown functions fx,fy and unknown A, all of 
which are determined by the constraints. The Kol- 
mogorov consistency condition for px(x), in particular, 
gives 



px(x) 
fx(x) 



dy My)e- X{x ~ y)2 



(10) 



In the following we shall only discuss the two cases 
of Gaussian and uniform marginals. For Gaussian 
marginals, one finds that p(x, y) is also Gaussian, and 
thus the results of [3( are obtained, 



I(X : Y) > I Gauss (C XY ) 
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l0g(l - C\y). (11) 



For uniform marginals, we indeed do not solve the prob- 
lem of finding a bound I~ if on the MI for given S, but 
we solve the easier implicit problem of finding both /~ nif 
and S for given A. We do this recursively, starting with 
the zeroth approximation 



f { x\x) = f^ (y) = i/2. 



(12) 



From the k-th approximation of fx and fy we obtain 
the (k + l)-st approximations by means of 
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Ak+i) 
J x 



(,•) 



2 / dvft>{v)e 



fr i] (y) 



= 2 



(13) 



(14) 



(k) (k) 

When doing this, we observe that f x and f Y are even 
functions for each k, and that both indeed are equal. We 
can thus drop the subscripts and write the recursion as 



f (fc+i) 



(a;) 



dy f {k \y)e- x{x - y) * 



(15) 
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A 


s 




0.00 


0.0000 


0.0000 


0.25 


0.0829 


0.0034 


0.50 


0.1633 


0.0135 


0.75 


0.2390 


0.0292 


1.00 


0.3086 


0.0495 


1.25 


0.3713 


0.0729 


1.50 


0.4270 


0.0984 


2.00 


0.5189 


0.1517 


2.50 


0.5897 


0.2040 


3.00 


0.6428 


0.2531 


4.00 


0.7177 


0.3396 


5.00 


0.7666 


0.4123 


6.00 


0.8007 


0.4746 


7.00 


0.8260 


0.5292 


8.00 


0.8455 


0.5777 


9.00 


0.8610 


0.6215 


10.00 


0.8736 


0.6614 


11.50 


0.8887 


0.7156 


13.00 


0.9005 


0.7636 


15.00 


0.9128 


0.8208 


17.00 


0.9224 


0.8717 


20.00 


0.9333 


0.9389 


23.00 


0.9415 


0.9975 


27.00 


0.9498 


1.0657 


32.00 


0.9572 


1.1393 


40.00 


0.9654 


1.2366 


50.00 


0.9721 


1.3357 



TABLE I: Spearman coefficient and lower bound on the MI 
(in natural units). 



After convergence, the joint density is obtained as 

p(x,y)oc lim f (k \x)f {k) {x)e- x{x - v)2 . (16) 

k— >oa 

Here we have left the normalization open, in order to al- 
low for errors in the numerical integration which might 
have accumulated during the recursion. The proportion- 
ality constant is thus fixed by the normalization condi- 
tion J p — 1. Finally, S and the lower bound I~ ni{ (S) on 
I(X : Y) are obtained by using Eq. (1) and 



S = 3 



dxdy xy p(x,y)e'~ { - x ~ y > 



(17) 



Numerical results for several values of A, obtained by 
using Gaussian quadrature for the integrals, are given in 
Table 1. Except for values of S close to ±1, I mi{ {S) is 
well approximated by 

^unif (S) ~ --(1-0.122 S 2 +0.053 S 12 ) log(l-S 2 ). (18) 

The two bounds for Gaussians [Eq. (|lll) ] and for uniform 
distributions [Eq. (TT51) ] are shown in Fig. 1. 




r unB (S) [Eq. (18)] 
l Gauss (C) [Eq. (11)] 
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FIG. 1: (color online) Lower bounds of MI in terms of the 
Spearman correlation coefficient (continuous line, red) and in 
terms of the Pearson correlation coefficient in case of Gaus- 
sian marginals (dashed, green). For both curves, the MI is 
measured in natural units. 
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FIG. 2: (color online) Mutual informations between gene 
BCL6 and all the other 12599 genes as measured in the mi- 
croarray gene expression experiment of [5j. Values of the MI 
were estimated by means of fc-nearest neighbors with k — 40. 
The green line is the lower bound discussed in this paper. 



As an application we show in Fig. 2 gene expression 
data obtained from human B lymphocyte cells [f|. In 
that experiment, the expressions of 12600 different gene 
loci were measured in 336 different conditions, with spe- 
cial interest in tumor cells. For each pair of genes the 
data can thus be represented as 336 points in a two- 
dimensional plane. Spearman coefficients were obtained 
by ranking both coordinates (after disambiguating de- 
generacies by adding low level noise as explained above). 
Mutual informations were estimated using the fc-nearest 
neighbor method of Q with k = 40. Although this was 
done for all 12600 x 12599/2 pairs, only results for the 
12599 pairs involving the important cancer gene BCL6 
are shown in Fig. 2. We can make the following observa- 
tions: 

• The bound is respected by most pairs, and it forms 
roughly a lower envelope for the distribution. 

• There are several pairs for which the bound is vio- 
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FIG. 3: (color online) Each panel shows the gene expression 
intensities (arbitrary units) of two genes, one of which is BCL6 
(x-axis). The other gene (y-axis) was chosen such as to have 
very large MI with BCL6, but very small Spearman coefficient 
(the uppermost two points in Fig. 2 with \S\ < 0.1). The color 
coding (green for BCL6 expression < 200 and AL079277 ex- 
pression < 500, red otherwise) is such that the same cell con- 
ditions have in both panels the same color. It suggests that 
the observed nonlinear correlations are related to the exis- 
tence of two cell populations with very different properties. 
The two genes correspond to accession numbers ,4,4978353 
(top) and and ,41,079277 (bottom). 



lated, mostly for small values of S. This reflects the 
fact that the MI estimator is not perfect. Indeed, 
no MI estimator can be perfect. Most estimators 
are chosen such that they never produce negative 
MI, which is achieved by tolerating a positive bias. 
The estimator of [|[ was constructed such that the 
bias is minimized, at the cost of obtaining occasion- 



ally negative values due to statistical fluctuations. 

• For most pairs the bound is not saturated, show- 
ing that there are important non-linear dependen- 
cies between these pairs. As an illustration for 
the latter we take the two points with |5| < 0.1 
and I > 0.3 and plot their gene expression vectors 
in Fig. 3. They show the co-expression of BCL6 
with the genes with GenBank accession numbers 
,4,4978353 (top) and ,4L079277 (bottom). In both 
panels of Fig. 3 we see very strong dependencies 
which cannot be approximated by linear correla- 
tions. Neither of these two genes is known to be re- 
lated to BCL6, maybe because such relations were 
overlooked because of the small linear correlations. 
The data suggest the presence of (at least) two dif- 
ferent sub-populations of cells, marked in Fig. 3 
by different colors. In the sub-population in which 
BCL6 is strongly expressed (red points in Fig. 3) 
there are also significant linear correlations. 

In summary, we have derived lower bounds on the MI 
between real- valued variables in terms of linear correla- 
tion coefficients. We have seen that such bounds are not 
independent of the marginal distribution, in contrast to 
the claims made in the appendix of [3j. But one can 
use the homeomorphism invariance of the MI to trans- 
form the variables to new variables with uniform distri- 
bution, in which case the linear correlation coefficient 
becomes equal to the Spearman coefficient S. At least 
in one specific and scientifically relevant example, the re- 
sulting bound of the MI in terms of S was found to be 
numerically non-trivial. In particular, large discrepancies 
between the bound and the actual values gave hints to 
specific structures in the data which then could be inves- 
tigated in more detail. The bound can also be useful in 
testing MI estimators. Usually, an estimator is deemed 
unacceptable if it violates the bound I(X : Y) > 0. But it 
would be equally unacceptable, if it violates the stronger 
bound I(X : Y) >I~. 

Finally, our results also answer the question of how lin- 
ear correlations change under reparametrizations. There 
is no reason to expect a universal exact answer, but ap- 
proximately they should change such that the numerical 
values of the bounds I~ stay the same. 

We thank Andrea Califano for providing us the data 
of Ref. @, and Alexander Kraskov and Maya Paczuski 
for discussions. 
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